This report provides an evaluation of the accuracy and precision of probabilistic forecasts submitted to the COVID-19 Forecast Hub over the last 8 weeks. The forecasts evaluated were submitted during the time period from November 03, 2020 through December 28, 2020. The revision dates of this data was calculated as of 2021-01-06.
In this weekly report we are evaluating forecasts made in 57 different locations (US on a national level, 50 states, and 6 territories), for 4 horizons over 8 submission weeks. We are evaluating 3 targets including incident cases, incident deaths, and cumulative deaths.
In collaboration with the US CDC, our team collects COVID-19 forecasts from dozens of teams around the globe. Each Monday evening or Tuesday morning, we combine the most recent forecasts from each team into a single "ensemble" forecast for each of the target submissions.
Typically on Wednesday or Thursday of each week, a summary of the week's forecast from the COVID-19 Forecast Hub, including the ensemble forecast, appear on the official CDC COVID-19 forecasting page.
This figure shows the number of incident cases reported each week. The period between the vertical lines shows the number of weeks for which models were evaluated
The figure below shows the number of locations that each model submitted forecasts for during this evaluation period. The models that are eligible for evaluation based on number of weeks submitted and number of targets for each week are bolded. The dates listed on the X axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on a Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.
This figure shows the number of locations each model subimitted an incidence case forecast for. The maximum number of locations is 57, which includes all 50 states, a National level forecast, and 6 US territories.
The number of models who submitted forecasts for incident cases is 39. The number of models that submitted forecasts for all 8 weeks was 35. The number of teams that submitted forecasts for all 57 locations was 8.
Each week, we generate a leaderboard table to assess the interval coverage, relative weighted interval scores (WIS), and relative mean average error (MAE) of each model. The data in this figure is aggregated across all submission weeks, locations, and horizons
For inclusion in this table, a team must have submitted a model for at least 4 out of the last 8 weeks. A model was counted if it included at least 25 locations and forecasts for 1 - 4 week ahead horizons.
Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95. The relative WIS and relative MAE scores are calculated based on a pairwise comparison developed by Johannes Bracher. The code for this comparison can be found here.
The relative WIS and relative MAE are calculated using a pairwise approach to acount variation in the difficulty of forecasting different weeks and locations. Models with a relative WIS or MAE lower than 1 are more accurate than the baseline and models with a relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident deaths.
In the following figures, we have evaluated the average WIS for models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all locations.
The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure also shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view a specific team, double click on the team name in the legend. To view a value on the plot, click on the point in the forecast of interest.
In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.
The following figure shows the scores of models aggregated by horizon and submission week. In this figure, we have only included models that have submitted forecasts for all 4 horizons and all submission weeks evaluated. The color scheme shows the WIS score relative to the baseline.
## [1] BPagano-RtDriven CEID-Walk Columbia_UNC-SurvCon
## [4] CovidAnalytics-DELPHI COVIDhub-baseline COVIDhub-ensemble
## [7] CU-nochange CU-scenario_high CU-scenario_low
## [10] CU-scenario_mid CU-select DDS-NBDS
## [13] IowaStateLW-STEM JCB-PRM JHU_CSSE-DECOM
## [16] JHUAPL-Bucky Karlen-pypm LANL-GrowthRate
## [19] LNQ-ens1 RobertWalraven-ESG UCLA-SuEIR
## [22] UCSB-ACTS
## 39 Levels: CovidAnalytics-DELPHI UCSB-ACTS ... Wadhwani_AI-BayesOpt
In the 8 week evaluation period, the evaluated Saturdays are 2020-11-14 through 2021-01-02. models submitted incident death forecasts. The number of models who submitted forecasts for incident deaths is 50. The number of models that submitted forecasts for all 8 was 49. The number of teams that submitted forecasts for all locations was 10.
The figure below shows the number of locations that each model submitted forecasts for during this evaluation period. The models that are eligible for evaluation based on number of weeks submitted and number of targets for each week are bolded. The dates listed on the X axis are the Saturday before the first horizon. This is the Saturday associated with the target submission week. If a model is submitted on a Tuesday - Friday, the Saturday listed occurs after the submission. If the model is submitted on a Sunday or Monday, the Saturday occurs before the submission date.
The figure below shows the number of locations and weeks that each team has submitted forecasts for.
Each week, we generate a leaderboard table to assess the interval coverage, relative weighted interval scores (WIS), and relative MAE of each model.
The weighted interval score is calculated to account for variation in the difficulty of forecasting different weeks and locations. The relative WIS and relative MAE are calculated using a pairwise approach to assess how accurate each model is compared to the baseline. Models with a relative WIS or relative MAE lower than 1 are more accurate than the baseline and models with a relative WIS and/or relative MAE greater than 1 are less accurate than the baseline is predicting the number of incident deaths.
For inclusion in this table, a team must have submitted a model for at least 4 weeks out of the last 8 weeks and have submitted a forecast for every horizon (1 - 4 weeks ahead) for each week.
In the following figures, we have evaluated the average WIS for models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all locations. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure also shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view a specific team, double click on the team name in the legend. To view a value on the plot, click on the point in the forecast of interest.
Finally, we have evaluated which locations teams had the lowest WIS scores for. In this figure, models were included if they submitted forecasts for all submission weeks and all horizons. The WIS scores stratified by location are included in each box. The color scheme shows the WIS score relative to the baseline.
The figure below shows the number of locations and weeks that each team has submitted forecasts for.
The number of models who submitted forecasts for cumulative deaths is 51. The number of models that submitted forecasts for all 8 weeks was 51. The number of teams that submitted forecasts for all 57 locations was 11.
The figure below shows the number of locations and weeks that each team has submitted forecasts for.
This analysis only uses models that have submitted forecasts for at least 4 weeks with 25 locations